Recently, density map regression-based methods have dominated in crowd counting owing to their excellent fitting ability on density distribution. However, further improvement tends to saturate mainly because of the confusing background noise and the large density variation. In this paper, we propose a Hierarchically Decoupled Network (HDNet) to solve the above two problems within a unified framework. Specifically, a background classification sub-task is decomposed from the density map prediction task, which is then assigned to a Density Decoupling Module (DDM) to exploit its highly discriminative ability. For the remaining foreground prediction sub-task, it is further hierarchically decomposed to several density-specific sub-tasks by the DDM, which are then solved by the regression-based experts in a Foreground Density Estimation Module (FDEM). Although the proposed strategy effectively reduces the hypothesis space so as to relieve the optimization for those task-specific experts, the high correlation of these sub-tasks are ignored. Therefore, we introduce three types of interaction strategies to unify the whole framework, which are Feature Interaction, Gradient Interaction, and Scale Interaction. Integrated with the above spirits, HDNet achieves state-of-the-art performance on several popular counting benchmarks.
translated by 谷歌翻译
弱监督对象本地化(WSOL)旨在仅通过使用图像级标签来学习对象本地化器。基于卷积神经网络(CNN)的技术通常导致突出显示物体的最辨别部分,同时忽略整个对象范围。最近,变压器架构已经部署到WSOL,以捕获具有自我关注机制和多层的Perceptron结构的远程特征依赖性。然而,变压器缺乏CNN所固有的局部感应偏差,因此可以恶化WSOL中的局部特征细节。在本文中,我们提出了一种基于变压器的新型框架,称为LCTR(局部连续性变压器),该框架被称为LCTR(局部连续性变压器),该框架在长期特征依赖项中提高全局特征的本地感知能力。为此,我们提出了一个关系的修补程序注意模块(RPAM),其考虑全球跨补丁信息。我们进一步设计了一个CUE挖掘模块(CDM),它利用本地特征来指导模型的学习趋势,以突出弱局部响应。最后,在两个广泛使用的数据集,即Cub-200-2011和ILSVRC上进行综合实验,以验证我们方法的有效性。
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
我们介绍了Soundspaces 2.0,这是一个用于3D环境的基于几何的音频渲染的平台。考虑到现实世界环境的3D网格,Soundspaces可以为从任意麦克风位置捕获的任意声音生成高度逼真的声音。它与现有的3D视觉资产一起支持一系列视听研究任务,例如视听导航,映射,源定位和分离以及声学匹配。与现有资源相比,Soundspaces 2.0具有允许连续的空间采样,对新型环境的概括以及可配置的麦克风和材料属性的优点。据我们所知,这是第一个基于几何的声学模拟,它提供了高忠诚和现实主义,同时也足够快地用于体现学习。我们展示了模拟器的属性,并根据现实世界的音频测量进行了基准性能。此外,通过涵盖具体导航和远场自动语音识别的两个下游任务,突出了后者的SIM2REAL性能。 Soundspaces 2.0可公开使用,以促进对感知系统的更广泛研究,这些系统既可以看到和听到。
translated by 谷歌翻译
房间冲动响应(RIR)函数捕获周围的物理环境如何改变听众听到的声音,对AR,VR和机器人技术中的各种应用产生影响。估计RIR的传统方法在整个环境中采用密集的几何形状和/或声音测量值,但我们探讨了如何根据空间中观察到的一组稀疏图像和回声来推断RIR。为了实现这一目标,我们介绍了一种基于变压器的方法,该方法使用自我注意力来构建丰富的声学环境,然后通过跨注意来预测任意查询源接收器位置的河流。此外,我们设计了一个新颖的训练目标,该目标改善了RIR预测与目标之间的声学​​特征中的匹配。在使用3D环境的最先进的视听模拟器的实验中,我们证明了我们的方法成功地生成了任意RIR,优于最先进的方法,并且在与传统方法的主要背离中 - 以几种方式概括新的环境。项目:http://vision.cs.utexas.edu/projects/fs_rir。
translated by 谷歌翻译
我们介绍了视觉匹配任务,其中音频剪辑被转换为听起来像是在目标环境中记录的。鉴于目标环境的图像和源音频的波形,目标是重新合成音频,以匹配目标室声音的可见几何形状和材料所建议的。为了解决这一新颖的任务,我们提出了一个跨模式变压器模型,该模型使用视听注意力将视觉属性注入音频并生成真实的音频输出。此外,我们设计了一个自我监督的训练目标,尽管他们缺乏声学上不匹配的音频,但可以从野外网络视频中学习声学匹配。我们证明,我们的方法成功地将人类的言语转化为图像中描绘的各种现实环境,表现优于传统的声学匹配和更严格的监督基线。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Knowledge graphs (KG) have served as the key component of various natural language processing applications. Commonsense knowledge graphs (CKG) are a special type of KG, where entities and relations are composed of free-form text. However, previous works in KG completion and CKG completion suffer from long-tail relations and newly-added relations which do not have many know triples for training. In light of this, few-shot KG completion (FKGC), which requires the strengths of graph representation learning and few-shot learning, has been proposed to challenge the problem of limited annotated data. In this paper, we comprehensively survey previous attempts on such tasks in the form of a series of methods and applications. Specifically, we first introduce FKGC challenges, commonly used KGs, and CKGs. Then we systematically categorize and summarize existing works in terms of the type of KGs and the methods. Finally, we present applications of FKGC models on prediction tasks in different areas and share our thoughts on future research directions of FKGC.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.
translated by 谷歌翻译